ONNX Attention CUDA: Coverage Gaps in Runner Fallback Paths

## ONNX Attention Op — CUDA Implementation Gap Tracking

Parent issue: #27516

### Status: All Gaps Closed ✅

All gaps addressed by #28198 (merged) and #27992 (in review).

### Dispatch Cascade (Final)

```
Flash Attention → MEA (CUTLASS) → Unified Unfused Attention
```

The Legacy MHA Unfused path (`QkvToContext` from `attention_impl.cu`) has been **eliminated** from the ONNX Attention op. The unified kernel handles both MHA and GQA.

### Support Matrix (Unified Unfused Attention)

| Feature | Flash | MEA | Unified Unfused |
|---------|-------|-----|-----------------|
| MHA (num_heads == kv_num_heads) | ✅ | ✅ | ✅ |
| GQA (num_heads > kv_num_heads) | ✅ | ✅ | ✅ |
| fp16 / bf16 | ✅ | ✅ | ✅ |
| fp32 | ❌ | ❌ (GQA) | ✅ |
| Softcap | ✅ | ✅ | ✅ |
| Padding mask (seqlens_k) | ✅ | ✅ | ✅ |
| Explicit attn_mask | ❌ | ✅ | ✅ |
| Softcap + mask | ❌ (no mask) | ✅ | ✅ |
| Past KV | ✅ | ✅ | ✅ |
| Past KV + H≠H_v | ❌ | ❌ | ✅ |
| Causal | ✅ | ✅ | ✅ |
| output_qk (mode 0) | ❌ | ❌ | ✅ |

### Previously Open Gaps — Now Closed

- ~~**output_qk**: Only supported in Legacy path~~ → Added `ScaledCopyQkKernel` to unified kernel (#27992)
- ~~**H≠H_v + past KV**: Not supported in GQA Unfused~~ → Separate K/V concat calls (#27992)
- ~~**MHA in unfused path**: Required Legacy wrapper~~ → Unified kernel handles MHA (group_size=1) (#27992)

### PRs

- #28198 — GQA unfused attention with FP32 QK accumulation ✅ Merged
- #27992 — Fix softcap ordering, unify unfused kernel, eliminate Legacy path 🔄 In Review

### ONNX Spec

- Spec bug: onnx/onnx#7865 (softcap ordering)
- Spec fix: onnx/onnx#7867

### Remaining Follow-ups (not blocking)

- #28215 — Causal + softcap interaction (spec divergence)

This issue will be closed when #27992 merges.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX Attention CUDA: Coverage Gaps in Runner Fallback Paths #27880

ONNX Attention Op — CUDA Implementation Gap Tracking

Status: All Gaps Closed ✅

Dispatch Cascade (Final)

Support Matrix (Unified Unfused Attention)

Previously Open Gaps — Now Closed

PRs

ONNX Spec

Remaining Follow-ups (not blocking)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature	Flash	MEA	Unified Unfused
MHA (num_heads == kv_num_heads)	✅	✅	✅
GQA (num_heads > kv_num_heads)	✅	✅	✅
fp16 / bf16	✅	✅	✅
fp32	❌	❌ (GQA)	✅
Softcap	✅	✅	✅
Padding mask (seqlens_k)	✅	✅	✅
Explicit attn_mask	❌	✅	✅
Softcap + mask	❌ (no mask)	✅	✅
Past KV	✅	✅	✅
Past KV + H≠H_v	❌	❌	✅
Causal	✅	✅	✅
output_qk (mode 0)	❌	❌	✅

ONNX Attention CUDA: Coverage Gaps in Runner Fallback Paths #27880

Description

ONNX Attention Op — CUDA Implementation Gap Tracking

Status: All Gaps Closed ✅

Dispatch Cascade (Final)

Support Matrix (Unified Unfused Attention)

Previously Open Gaps — Now Closed

PRs

ONNX Spec

Remaining Follow-ups (not blocking)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions