Skip to content

ONNX Attention CUDA: Coverage Gaps in Runner Fallback Paths #27880

@titaiwangms

Description

@titaiwangms

ONNX Attention Op — CUDA Implementation Gap Tracking

Parent issue: #27516

Status: All Gaps Closed ✅

All gaps addressed by #28198 (merged) and #27992 (in review).

Dispatch Cascade (Final)

Flash Attention → MEA (CUTLASS) → Unified Unfused Attention

The Legacy MHA Unfused path (QkvToContext from attention_impl.cu) has been eliminated from the ONNX Attention op. The unified kernel handles both MHA and GQA.

Support Matrix (Unified Unfused Attention)

Feature Flash MEA Unified Unfused
MHA (num_heads == kv_num_heads)
GQA (num_heads > kv_num_heads)
fp16 / bf16
fp32 ❌ (GQA)
Softcap
Padding mask (seqlens_k)
Explicit attn_mask
Softcap + mask ❌ (no mask)
Past KV
Past KV + H≠H_v
Causal
output_qk (mode 0)

Previously Open Gaps — Now Closed

PRs

ONNX Spec

Remaining Follow-ups (not blocking)

This issue will be closed when #27992 merges.

Metadata

Metadata

Assignees

Labels

ep:CUDAissues related to the CUDA execution provider

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions