Skip to content

[Feature] VisualGen: Add qknorm + rope fuse kernel for cross-head norm (Wan/LTX-2) #12716

@schetlur-nv

Description

@schetlur-nv

🚀 The feature, motivation and pitch

For visual gen attention blocks, all current models have adopted the qk-norm attention path, i.e.,
 qkv_proj → qk_rmsnorm → qk_rope → attn (BMMs)

Unlike LLM’s qk-norm attention module, visual gen currently does not have a unified fused_qk_norm_rope path, due to many variants. For example, qk_rmsnorm can be per-head norm or cross-head norm; RoPE can be interleaved or split-half; and q/k can have different sequence lengths (cross-attn vs. self-attn).

So far, fused qk_norm_rope kernel for Flux (per-head, interleaved, self-attn path), shows ~5–8% perf gain.
It would be great to extend this fused path to Wan / LTX2 models. In particular, qk-norm in Wan/LTX2 uses cross-head norm kernels, so this could also be an opportunity to further optimize cross-head norm kernel performance (especially for LTX2, where norm takes 5-10% of runtime).

References:

Flux fused qknorm+rope PR #11869
SGLang similar effort:  sgl-project/sglang#21503 / sgl-project/sglang#21440

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.feature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions