Skip to content

Fix GroupQueryAttention right-padded rotary prefill CUDA test#29218

Merged
tianleiwu merged 2 commits into
microsoft:mainfrom
tianleiwu:tlwu/fix_cuda_ci_gqa_test_failure
Jun 23, 2026
Merged

Fix GroupQueryAttention right-padded rotary prefill CUDA test#29218
tianleiwu merged 2 commits into
microsoft:mainfrom
tianleiwu:tlwu/fix_cuda_ci_gqa_test_failure

Conversation

@tianleiwu

@tianleiwu tianleiwu commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Description

The GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA test (added in #29002) fed fp32 inputs via AddInput<float>. The CUDA (and WebGPU) GroupQueryAttention kernels only register for MLFloat16/BFloat16, so the fp32 node silently fell back to the CPU EP — the _CUDA test never actually exercised the CUDA kernel it is named for. This surfaced as a CI failure on the CUDA test leg after #29002 and #29046 merged.

This PR makes RunGQAPackedQKVRotaryPrefill feed fp16 tensors when targeting CUDA EP, matching the existing RunGQASharedKVFp16 convention and the test's own "loose enough for fp16 rounding" tolerance. The CPU code path is unchanged.

Key Changes

  • RunGQAPackedQKVRotaryPrefill now branches on the target EP:
    • CUDA EP: inputs/outputs use MLFloat16 (converted via ToFloat16), so the node is placed on the real GPU kernel.
    • WebGPU/CPU EP: unchanged (float).
  • Output is converted back to float for the existing comparison logic.

Testing

  • onnxruntime_provider_test --gtest_filter='GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA'PASSED (now runs on the CUDA fp16 kernel).
  • Full GroupQueryAttentionTest.* suite → 47 passed, WebGPU-only tests skipped locally (no WebGPU EP), no regressions.

Motivation and Context

Restores genuine CUDA kernel coverage for the right-padded rotary prefill scenario and fixes the CI failure. Related: #29002, #29046.

BatchedRightPaddedRotaryPrefill_CUDA fed fp32 inputs via AddInput<float>. The
CUDA/WebGPU GroupQueryAttention kernels only register for MLFloat16/BFloat16, so
the fp32 node silently fell back to the CPU EP and the _CUDA test never exercised
the CUDA kernel it is named for.

Make RunGQAPackedQKVRotaryPrefill feed fp16 tensors when targeting a GPU EP
(matching the existing RunGQASharedKVFp16 convention and the test's own fp16
tolerance), so the test runs on the actual CUDA kernel. The CPU path is
unchanged. Verified the CUDA fp16 path passes the right-padded prefill.
Comment thread onnxruntime/test/contrib_ops/group_query_attention_op_test.cc Outdated
@hariharans29

Copy link
Copy Markdown
Member

FYI @qjia7

@tianleiwu tianleiwu enabled auto-merge (squash) June 22, 2026 21:58
@tianleiwu tianleiwu merged commit 14a6c9e into microsoft:main Jun 23, 2026
96 of 98 checks passed
tianleiwu added a commit that referenced this pull request Jun 23, 2026
### Description

The `GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA` test
(added in #29002) fed **fp32** inputs via `AddInput<float>`. The CUDA
(and WebGPU) GroupQueryAttention kernels only register for
`MLFloat16`/`BFloat16`, so the fp32 node silently fell back to the **CPU
EP** — the `_CUDA` test never actually exercised the CUDA kernel it is
named for. This surfaced as a CI failure on the CUDA test leg after
#29002 and #29046 merged.

This PR makes `RunGQAPackedQKVRotaryPrefill` feed **fp16** tensors when
targeting CUDA EP, matching the existing `RunGQASharedKVFp16` convention
and the test's own "loose enough for fp16 rounding" tolerance. The CPU
code path is unchanged.

### Key Changes

- `RunGQAPackedQKVRotaryPrefill` now branches on the target EP:
- CUDA EP: inputs/outputs use `MLFloat16` (converted via `ToFloat16`),
so the node is placed on the real GPU kernel.
  - WebGPU/CPU EP: unchanged (`float`).
- Output is converted back to `float` for the existing comparison logic.

### Testing

- `onnxruntime_provider_test
--gtest_filter='GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA'`
→ **PASSED** (now runs on the CUDA fp16 kernel).
- Full `GroupQueryAttentionTest.*` suite → 47 passed, WebGPU-only tests
skipped locally (no WebGPU EP), no regressions.

### Motivation and Context

Restores genuine CUDA kernel coverage for the right-padded rotary
prefill scenario and fixes the CI failure. Related: #29002, #29046.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants