You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CUDA Plugin Cleanup for Shared Kernel Helpers (microsoft#27915)
## Description
This PR reduces the amount of CUDA plugin-specific compatibility code by
moving reusable validation and attribute-reading logic into shared
helper paths that work for both bundled and plugin builds. It also fills
in a missing allocator hook in the EP adapter so plugin kernels can
reuse the same initialization path as the in-tree CUDA EP, which
simplifies maintenance and improves behavior parity. The follow-up
changes update the CUDA plugin design doc to reflect the new
shared-helper model and add focused plugin regression tests for the two
runtime paths that changed most materially.
## Summary of Changes
### EP adapter and shared helper extraction
| File | Change |
|------|--------|
| `ep/adapter/op_kernel_info.h` | Adds
`OpKernelInfo::GetAllocator(OrtMemType)` so adapter-based kernels can
request device or CPU temp allocators in plugin builds. |
| `cpu/tensor/scatter_nd.h` | Extracts shape validation into
`scatter_nd_internal::ValidateShapes` so the same logic can be reused
outside the CPU `ScatterND` class. |
| `cpu/tensor/space_depth_ops.h` | Moves blocksize parsing, mode
parsing, and dimension validation into `space_depth_internal` helpers
that can be shared by CUDA kernels. |
### CUDA kernel cleanup and plugin parity
| File | Change |
|------|--------|
| `cuda/tensor/scatter_nd.cc` | Removes the plugin-only `ScatterND`
validation duplicate and reuses the shared helper implementation. |
| `cuda/tensor/scatter_nd.h` | Drops the old conditional include split
now that validation is shared through the common helper path. |
| `cuda/tensor/space_depth_ops.h` | Deletes the plugin-only
`SpaceToDepth`/`DepthToSpace` reimplementation and inherits from the
shared base/helper logic in all builds. |
| `cuda/tensor/upsample.cc` | Reuses the normal antialias lookup-table
allocation/caching path in plugin builds via the new allocator adapter
support. |
| `cuda/tensor/upsample.h` | Keeps the persistent device lookup-table
member available in plugin builds as well. |
### Shared-provider and diagnostics alignment
| File | Change |
|------|--------|
| `cpu/cpu_provider_shared.cc` | Routes shared-provider `ScatterND`
shape validation through the extracted helper. |
| `provider_bridge_provider.cc` | Updates the bridge-side
`ScatterND::ValidateShapes` implementation to call the shared helper
directly. |
| `cuda/cudnn_common.h` | Preserves the batch-norm epsilon warning path
in plugin builds instead of suppressing it. |
| `cuda/nn/conv.cc` | Removes plugin-specific shortened cuDNN frontend
errors so bundled and plugin builds both include frontend JSON in
failures. |
| `cuda/nn/conv_transpose.cc` | Extends cuDNN frontend failures to
include frontend JSON for easier debugging, matching the `Conv`
behavior. |
### Documentation and regression coverage
| File | Change |
|------|--------|
| `cuda_plugin_ep_design.md` | Updates the design doc to reflect that
`ScatterND`, `SpaceDepth`, and `Upsample` now use shared adapter-safe
helper paths instead of plugin-only fallback branches. |
| `test_cuda_plugin_ep.py` | Adds plugin regression coverage for
antialias `Resize`/`Upsample` and `ScatterND`, covering the new
allocator-backed lookup-table path and the shared `ScatterND` validation
helper. |
## Testing
- Build with `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON` and verify the
affected CUDA provider sources compile without the removed plugin-only
fallback paths.
- Run targeted CUDA provider coverage for `ScatterND`,
`SpaceToDepth`/`DepthToSpace`, `Resize`/`Upsample`, `Conv`, and
`ConvTranspose` in both plugin and bundled CUDA configurations.
- Confirm antialias upsample still initializes and uses the shared
lookup table correctly in plugin builds.
- Run the new plugin tests for antialias `Resize` and `ScatterND` in
`onnxruntime/test/python/transformers/test_cuda_plugin_ep.py`.
- Confirm cuDNN frontend failure paths now emit the same diagnostic
detail in plugin and non-plugin builds.
## Motivation and Context
The initial CUDA plugin enablement introduced several localized `#ifdef
BUILD_CUDA_EP_AS_PLUGIN` branches and helper copies to get kernels
compiling under the adapter path. This cleanup pays down that
compatibility debt by extracting the truly shared pieces into reusable
helpers and by teaching the adapter `OpKernelInfo` how to provide the
allocators those kernels already expect. The result is less duplicated
logic, fewer plugin-only code paths to keep in sync, and better
debugging consistency between the plugin EP and the built-in CUDA EP.
## Checklist
- [x] Tests added/updated
- [x] Documentation updated (if applicable)
- [x] No breaking changes (or documented in description)
-`crop.h` — `CropBase` constructor (templatized on info type)
220
-
-`space_depth_ops.h` — `SpaceDepthBase` constructor (templatized on info type)
220
+
-`space_depth_ops.h` — `SpaceDepthBase` constructor plus shared `ReadBlocksize`, `ReadIsDCR`, and dimension-validation helpers (templatized on info/context type where needed)
221
221
-`clip.h` — Clip min/max attribute handling (removed `Clip_6Base` CPU dependency)
222
222
-`cuda_common_type_helpers.h` — CUDA type conversion and handle error string helpers (moved from `cuda_common.cc`)
223
223
@@ -253,7 +253,8 @@ This allows the base class constructor to work with both the framework `OpKernel
253
253
Some CPU base classes have heavy dependencies (protobuf, `UnpackTensor`) that make inlining impractical:
254
254
255
255
-**`ConstantOfShapeBase`** — depends on `TensorProto` and `UnpackTensor`. The plugin path in `constant_of_shape.h` stays self-contained: it reuses `ConstantOfShapeCore` but fetches the `value` attribute through the ORT C++ API instead of depending on the full CPU base implementation.
256
-
-**`UpsampleBase`** — partially addressed: `AdjustOutputSizeAsPolicy` moved to header (#27628). Still depends on `InputDefs()` and `OpKernelInfo::GetAllocator()` which are not in the adapter.
256
+
257
+
`UpsampleBase` no longer belongs in this category: the adapter now exposes `OpKernelInfo::GetAllocator(OrtMemType)`, and the remaining shape-rank query already has an adapter-safe fallback when `Node::InputDefs()` is unavailable. That lets the CUDA `Upsample` antialias path reuse the same persistent device lookup-table initialization in both bundled and plugin builds instead of keeping a plugin-only scratch-buffer fallback.
257
258
258
259
---
259
260
@@ -619,7 +620,7 @@ The branch still contains a small set of plugin guards in both infrastructure an
619
620
-`generator/constant_of_shape.h` still needs a plugin-specific path because `ConstantOfShapeBase` depends on framework-only tensor-attribute helpers.
620
621
- Tunable kernels such as `math/matmul.cc` still gate framework-only registration paths.
621
622
-`tensor/identity_op.h` guards the `TensorSeq` code path and `context->InputType()` call with `#ifndef BUILD_CUDA_EP_AS_PLUGIN` — the plugin build handles only the `Tensor` path. `identity_op.cc` uses conditional macros (`IDENTITY_V_TYPES` / `IDENTITY_V_TYPES_IRv9`) so opset 14+ registrations use `AllFixedSizeTensorTypes()` in the plugin build. Additionally, old Dropout opset 7–9 and 10–11 kernel registrations were moved from `identity_op.cc` to `nn/dropout.cc` so that each op's registrations live in that op's own source file.
622
-
- A few tensor kernels (`pad.cc`, `tile.cc`, `unsqueeze.cc`, `upsample.*`, `space_depth_ops.h`, `scatter_nd.*`) still contain localized plugin guards where adapterand framework paths have not fully converged.
623
+
- A few tensor kernels (`pad.cc`, `tile.cc`, `unsqueeze.cc`) still contain localized plugin guards where adapter and framework paths have not fully converged. Recent cleanup removed the plugin-only branches from `upsample.*`, `space_depth_ops.h`, and `scatter_nd.*` by moving reusable logic into shared adapter-safe helpers and by adding allocator access to `ep::adapter::OpKernelInfo`.
623
624
624
625
The broad trend remains positive: most operator-level plugin conditionals were removed by moving reusable CPU/helper logic into shared headers and by centralizing stream bridging in `CudaKernel` helpers.
0 commit comments