Skip to content

Add LpNormalization support for CUDA Execution Provider#28724

Open
apsonawane wants to merge 4 commits into
mainfrom
asonawane/lp
Open

Add LpNormalization support for CUDA Execution Provider#28724
apsonawane wants to merge 4 commits into
mainfrom
asonawane/lp

Conversation

@apsonawane
Copy link
Copy Markdown
Contributor

This pull request adds CUDA (GPU) support for the LpNormalization ONNX operator in ONNX Runtime, including implementation, kernel registration, and new unit tests (notably for FP16). The main changes involve adding the CUDA kernel, wiring it up for opsets 1–22, and extending the test suite to cover new scenarios and datatypes.

CUDA LpNormalization Operator Support:

  • Implemented CUDA kernel for LpNormalization supporting float, double, and MLFloat16 datatypes, with efficient handling for both L1 and L2 normalization. [1] [2] [3] [4]
  • Registered the CUDA kernel for LpNormalization for opsets 1–21 (versioned) and opset 22 (current), for all supported datatypes (float, double, MLFloat16). [1] [2] [3] [4]

Testing and Validation:

  • Added new unit tests for LpNormalization covering FP16, various axes, and both L1/L2 normalization, ensuring CUDA kernel correctness and excluding unsupported providers. [1] [2]
  • Updated backend test filters to reflect the current status of LpNormalization-related tests.

These changes collectively enable and validate GPU-accelerated LpNormalization in ONNX Runtime for a wide range of models and datatypes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a CUDA implementation of the ONNX LpNormalization operator to ONNX Runtime (opsets 1–22), and extends the unit tests and backend-test filters to validate/track the new support (including FP16 scenarios).

Changes:

  • Added CUDA kernel implementation for LpNormalization (float/double/MLFloat16) and wired it into the CUDA EP kernel registration for opsets 1–22.
  • Added new unit tests covering FP16 and additional axis scenarios.
  • Updated ONNX backend test series filters to narrow the currently-skipped l2normalization zero-norm cases.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonc Refines skipped l2normalization backend tests to specific zero-norm cases.
onnxruntime/test/providers/cpu/nn/lp_norm_op_test.cc Adds new LpNormalization tests (including FP16 and additional axes).
onnxruntime/core/providers/cuda/nn/lp_norm.h Introduces CUDA kernel class wrapper for LpNormalization.
onnxruntime/core/providers/cuda/nn/lp_norm.cc Implements CUDA kernel registration + ComputeInternal calling into the CUDA impl.
onnxruntime/core/providers/cuda/nn/lp_norm_impl.h / .cu Adds CUDA device implementation for L1/L2 normalization.
onnxruntime/core/providers/cuda/cuda_execution_provider.cc Registers the new LpNormalization CUDA kernels for opsets 1–22.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/cuda/nn/lp_norm_impl.cu
Comment thread onnxruntime/core/providers/cuda/nn/lp_norm_impl.cu
Comment thread onnxruntime/core/providers/cuda/nn/lp_norm_impl.cu Outdated
Comment thread onnxruntime/core/providers/cuda/nn/lp_norm_impl.cu Outdated
Comment thread onnxruntime/core/providers/cuda/nn/lp_norm.cc
Comment thread onnxruntime/test/providers/cpu/nn/lp_norm_op_test.cc Outdated
Comment thread onnxruntime/test/providers/cpu/nn/lp_norm_op_test.cc Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment thread onnxruntime/core/providers/cuda/nn/lp_norm_impl.cu Outdated
Comment thread onnxruntime/core/providers/cuda/nn/lp_norm_impl.cu Outdated
Comment thread onnxruntime/core/providers/cuda/nn/lp_norm_impl.cu Outdated
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-scoped addition of the CUDA LpNormalization kernel. The element indexing exactly mirrors the existing CPU kernel, the block reduction uses a power-of-two thread count via NextPowerOfTwo, and FP16 accumulation is correctly done in float via AccumulationType_t<T>. The major concerns from earlier review rounds (non-power-of-two reduction, FP16 accumulation overflow, empty-tensor division-by-zero, and test EP setup) are all addressed in the current head.

A couple of minor, non-blocking suggestions are left inline. One clarification on the prior automated review: the zero-norm branch writing 0 actually matches the ORT CPU kernel (yVec.setZero()); both diverge from the ONNX spec equally, which is why the corresponding backend tests remain in current_failing_tests. So the kernel is consistent with the CPU path here.

Remaining items:

  • A coverage gap: the FP16 tests use norm_size = 3 with small magnitudes and pass with or without the accumulation fix; a test with a longer axis (128-256) and larger magnitudes would actually validate the float-accumulation overflow protection.

Comment thread onnxruntime/core/providers/cuda/nn/lp_norm_impl.cu Outdated
Comment thread onnxruntime/core/providers/cuda/nn/lp_norm_impl.cu Outdated
Comment thread onnxruntime/core/providers/cuda/nn/lp_norm.cc Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants