You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix GroupQueryAttention right-padded rotary prefill CUDA test (#29218)
### Description
The `GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA` test
(added in #29002) fed **fp32** inputs via `AddInput<float>`. The CUDA
(and WebGPU) GroupQueryAttention kernels only register for
`MLFloat16`/`BFloat16`, so the fp32 node silently fell back to the **CPU
EP** — the `_CUDA` test never actually exercised the CUDA kernel it is
named for. This surfaced as a CI failure on the CUDA test leg after
#29002 and #29046 merged.
This PR makes `RunGQAPackedQKVRotaryPrefill` feed **fp16** tensors when
targeting CUDA EP, matching the existing `RunGQASharedKVFp16` convention
and the test's own "loose enough for fp16 rounding" tolerance. The CPU
code path is unchanged.
### Key Changes
- `RunGQAPackedQKVRotaryPrefill` now branches on the target EP:
- CUDA EP: inputs/outputs use `MLFloat16` (converted via `ToFloat16`),
so the node is placed on the real GPU kernel.
- WebGPU/CPU EP: unchanged (`float`).
- Output is converted back to `float` for the existing comparison logic.
### Testing
- `onnxruntime_provider_test
--gtest_filter='GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA'`
→ **PASSED** (now runs on the CUDA fp16 kernel).
- Full `GroupQueryAttentionTest.*` suite → 47 passed, WebGPU-only tests
skipped locally (no WebGPU EP), no regressions.
### Motivation and Context
Restores genuine CUDA kernel coverage for the right-padded rotary
prefill scenario and fixes the CI failure. Related: #29002, #29046.
0 commit comments