Fix GroupQueryAttention right-padded rotary prefill CUDA test by tianleiwu · Pull Request #29218 · microsoft/onnxruntime

tianleiwu · 2026-06-22T21:35:49Z

Description

The GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA test (added in #29002) fed fp32 inputs via AddInput<float>. The CUDA (and WebGPU) GroupQueryAttention kernels only register for MLFloat16/BFloat16, so the fp32 node silently fell back to the CPU EP — the _CUDA test never actually exercised the CUDA kernel it is named for. This surfaced as a CI failure on the CUDA test leg after #29002 and #29046 merged.

This PR makes RunGQAPackedQKVRotaryPrefill feed fp16 tensors when targeting CUDA EP, matching the existing RunGQASharedKVFp16 convention and the test's own "loose enough for fp16 rounding" tolerance. The CPU code path is unchanged.

Key Changes

RunGQAPackedQKVRotaryPrefill now branches on the target EP:
- CUDA EP: inputs/outputs use MLFloat16 (converted via ToFloat16), so the node is placed on the real GPU kernel.
- WebGPU/CPU EP: unchanged (float).
Output is converted back to float for the existing comparison logic.

Testing

onnxruntime_provider_test --gtest_filter='GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA' → PASSED (now runs on the CUDA fp16 kernel).
Full GroupQueryAttentionTest.* suite → 47 passed, WebGPU-only tests skipped locally (no WebGPU EP), no regressions.

Motivation and Context

Restores genuine CUDA kernel coverage for the right-padded rotary prefill scenario and fixes the CI failure. Related: #29002, #29046.

BatchedRightPaddedRotaryPrefill_CUDA fed fp32 inputs via AddInput<float>. The CUDA/WebGPU GroupQueryAttention kernels only register for MLFloat16/BFloat16, so the fp32 node silently fell back to the CPU EP and the _CUDA test never exercised the CUDA kernel it is named for. Make RunGQAPackedQKVRotaryPrefill feed fp16 tensors when targeting a GPU EP (matching the existing RunGQASharedKVFp16 convention and the test's own fp16 tolerance), so the test runs on the actual CUDA kernel. The CPU path is unchanged. Verified the CUDA fp16 path passes the right-padded prefill.

hariharans29 · 2026-06-22T21:55:23Z

FYI @qjia7

### Description The `GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA` test (added in #29002) fed **fp32** inputs via `AddInput<float>`. The CUDA (and WebGPU) GroupQueryAttention kernels only register for `MLFloat16`/`BFloat16`, so the fp32 node silently fell back to the **CPU EP** — the `_CUDA` test never actually exercised the CUDA kernel it is named for. This surfaced as a CI failure on the CUDA test leg after #29002 and #29046 merged. This PR makes `RunGQAPackedQKVRotaryPrefill` feed **fp16** tensors when targeting CUDA EP, matching the existing `RunGQASharedKVFp16` convention and the test's own "loose enough for fp16 rounding" tolerance. The CPU code path is unchanged. ### Key Changes - `RunGQAPackedQKVRotaryPrefill` now branches on the target EP: - CUDA EP: inputs/outputs use `MLFloat16` (converted via `ToFloat16`), so the node is placed on the real GPU kernel. - WebGPU/CPU EP: unchanged (`float`). - Output is converted back to `float` for the existing comparison logic. ### Testing - `onnxruntime_provider_test --gtest_filter='GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA'` → **PASSED** (now runs on the CUDA fp16 kernel). - Full `GroupQueryAttentionTest.*` suite → 47 passed, WebGPU-only tests skipped locally (no WebGPU EP), no regressions. ### Motivation and Context Restores genuine CUDA kernel coverage for the right-padded rotary prefill scenario and fixes the CI failure. Related: #29002, #29046.

hariharans29 reviewed Jun 22, 2026

View reviewed changes

Comment thread onnxruntime/test/contrib_ops/group_query_attention_op_test.cc Outdated

update

0b77086

hariharans29 approved these changes Jun 22, 2026

View reviewed changes

tianleiwu enabled auto-merge (squash) June 22, 2026 21:58

tianleiwu merged commit 14a6c9e into microsoft:main Jun 23, 2026
96 of 98 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix GroupQueryAttention right-padded rotary prefill CUDA test#29218

Fix GroupQueryAttention right-padded rotary prefill CUDA test#29218
tianleiwu merged 2 commits into
microsoft:mainfrom
tianleiwu:tlwu/fix_cuda_ci_gqa_test_failure

tianleiwu commented Jun 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

hariharans29 commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tianleiwu commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

Testing

Motivation and Context

Uh oh!

Uh oh!

hariharans29 commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tianleiwu commented Jun 22, 2026 •

edited

Loading