You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All gaps addressed by #28198 (merged) and #27992 (in review).
Dispatch Cascade (Final)
Flash Attention → MEA (CUTLASS) → Unified Unfused Attention
The Legacy MHA Unfused path (QkvToContext from attention_impl.cu) has been eliminated from the ONNX Attention op. The unified kernel handles both MHA and GQA.
ONNX Attention Op — CUDA Implementation Gap Tracking
Parent issue: #27516
Status: All Gaps Closed ✅
All gaps addressed by #28198 (merged) and #27992 (in review).
Dispatch Cascade (Final)
The Legacy MHA Unfused path (
QkvToContextfromattention_impl.cu) has been eliminated from the ONNX Attention op. The unified kernel handles both MHA and GQA.Support Matrix (Unified Unfused Attention)
Previously Open Gaps — Now Closed
output_qk: Only supported in Legacy path→ AddedScaledCopyQkKernelto unified kernel (Eliminate Legacy MHA Unfused path from ONNX Attention; unify on 3-tier dispatch with causal alignment fix #27992)H≠H_v + past KV: Not supported in GQA Unfused→ Separate K/V concat calls (Eliminate Legacy MHA Unfused path from ONNX Attention; unify on 3-tier dispatch with causal alignment fix #27992)MHA in unfused path: Required Legacy wrapper→ Unified kernel handles MHA (group_size=1) (Eliminate Legacy MHA Unfused path from ONNX Attention; unify on 3-tier dispatch with causal alignment fix #27992)PRs
ONNX Spec
Remaining Follow-ups (not blocking)
This issue will be closed when #27992 merges.