Add LpNormalization support for CUDA Execution Provider#28724
Add LpNormalization support for CUDA Execution Provider#28724apsonawane wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a CUDA implementation of the ONNX LpNormalization operator to ONNX Runtime (opsets 1–22), and extends the unit tests and backend-test filters to validate/track the new support (including FP16 scenarios).
Changes:
- Added CUDA kernel implementation for
LpNormalization(float/double/MLFloat16) and wired it into the CUDA EP kernel registration for opsets 1–22. - Added new unit tests covering FP16 and additional axis scenarios.
- Updated ONNX backend test series filters to narrow the currently-skipped
l2normalizationzero-norm cases.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonc | Refines skipped l2normalization backend tests to specific zero-norm cases. |
| onnxruntime/test/providers/cpu/nn/lp_norm_op_test.cc | Adds new LpNormalization tests (including FP16 and additional axes). |
| onnxruntime/core/providers/cuda/nn/lp_norm.h | Introduces CUDA kernel class wrapper for LpNormalization. |
| onnxruntime/core/providers/cuda/nn/lp_norm.cc | Implements CUDA kernel registration + ComputeInternal calling into the CUDA impl. |
| onnxruntime/core/providers/cuda/nn/lp_norm_impl.h / .cu | Adds CUDA device implementation for L1/L2 normalization. |
| onnxruntime/core/providers/cuda/cuda_execution_provider.cc | Registers the new LpNormalization CUDA kernels for opsets 1–22. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tianleiwu
left a comment
There was a problem hiding this comment.
Solid, well-scoped addition of the CUDA LpNormalization kernel. The element indexing exactly mirrors the existing CPU kernel, the block reduction uses a power-of-two thread count via NextPowerOfTwo, and FP16 accumulation is correctly done in float via AccumulationType_t<T>. The major concerns from earlier review rounds (non-power-of-two reduction, FP16 accumulation overflow, empty-tensor division-by-zero, and test EP setup) are all addressed in the current head.
A couple of minor, non-blocking suggestions are left inline. One clarification on the prior automated review: the zero-norm branch writing 0 actually matches the ORT CPU kernel (yVec.setZero()); both diverge from the ONNX spec equally, which is why the corresponding backend tests remain in current_failing_tests. So the kernel is consistent with the CPU path here.
Remaining items:
- A coverage gap: the FP16 tests use
norm_size = 3with small magnitudes and pass with or without the accumulation fix; a test with a longer axis (128-256) and larger magnitudes would actually validate thefloat-accumulation overflow protection.
This pull request adds CUDA (GPU) support for the
LpNormalizationONNX operator in ONNX Runtime, including implementation, kernel registration, and new unit tests (notably for FP16). The main changes involve adding the CUDA kernel, wiring it up for opsets 1–22, and extending the test suite to cover new scenarios and datatypes.CUDA LpNormalization Operator Support:
LpNormalizationsupporting float, double, and MLFloat16 datatypes, with efficient handling for both L1 and L2 normalization. [1] [2] [3] [4]LpNormalizationfor opsets 1–21 (versioned) and opset 22 (current), for all supported datatypes (float,double,MLFloat16). [1] [2] [3] [4]Testing and Validation:
LpNormalizationcovering FP16, various axes, and both L1/L2 normalization, ensuring CUDA kernel correctness and excluding unsupported providers. [1] [2]These changes collectively enable and validate GPU-accelerated LpNormalization in ONNX Runtime for a wide range of models and datatypes.