Skip to content

Add opt-in Exclusive Self Attention support#13710

Open
taivu1998 wants to merge 1 commit intohuggingface:mainfrom
taivu1998:tdv/issue-13447-xsa
Open

Add opt-in Exclusive Self Attention support#13710
taivu1998 wants to merge 1 commit intohuggingface:mainfrom
taivu1998:tdv/issue-13447-xsa

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

Adds opt-in support for Exclusive Self Attention (XSA) in the shared Diffusers attention stack.

  • Adds exclusive_self_attention=False to Attention.
  • Applies the XSA projection after attention output and before output projection for verified self-attention paths.
  • Wires the option through BasicTransformerBlock.
  • Exposes the option through Transformer2DModel, DiTTransformer2DModel, and PixArtTransformer2DModel configs.
  • Adds focused tests for the XSA math, self-vs-cross attention gating, processor variants, and model config/forward propagation.

Fixes #13447.

Motivation

Issue #13447 requests an optional Exclusive Self Attention mode based on:

z_i = y_i - (y_i @ v_i) / (v_i @ v_i) * v_i

This PR implements the equivalent normalized projection form:

value_normalized = F.normalize(value, p=2, dim=-1)
hidden_states = hidden_states - (hidden_states * value_normalized).sum(dim=-1, keepdim=True) * value_normalized

The flag is stored on the Attention module rather than on processor instances, so it remains stable when attention processors are swapped.

Scope

Supported in this PR:

  • AttnProcessor
  • AttnProcessor2_0
  • FusedAttnProcessor2_0
  • SlicedAttnProcessor
  • XFormersAttnProcessor
  • AttnProcessorNPU
  • XLAFlashAttnProcessor2_0

The projection is applied only when the current call is true self-attention (encoder_hidden_states is None). Cross-attention remains unchanged even if sequence lengths happen to match.

Model-level exposure is included for:

  • Transformer2DModel
  • DiTTransformer2DModel
  • PixArtTransformer2DModel

Deferred

This intentionally does not change added-KV processors, joint attention processors, direct-dispatch model-specific processors, or full U-Net constructor propagation. Those paths have more ambiguous token/value pairing semantics and are better handled in follow-up PRs if maintainers want broader coverage.

Validation

Ran:

PYTHONPATH=src python -m pytest tests/models/test_exclusive_self_attention.py
PYTHONPATH=src python -m pytest tests/models/test_layers_utils.py -k "test_spatial_transformer_default or exclusive_self_attention"
PYTHONPATH=src python -m pytest tests/models/transformers/test_models_dit_transformer2d.py -k "test_output or exclusive_self_attention"
PYTHONPATH=src python -m pytest tests/models/transformers/test_models_pixart_transformer2d.py -k "test_output or exclusive_self_attention"
python -m py_compile src/diffusers/models/attention_processor.py src/diffusers/models/attention.py src/diffusers/models/transformers/transformer_2d.py src/diffusers/models/transformers/dit_transformer_2d.py src/diffusers/models/transformers/pixart_transformer_2d.py tests/models/test_exclusive_self_attention.py tests/models/test_layers_utils.py tests/models/transformers/test_models_dit_transformer2d.py tests/models/transformers/test_models_pixart_transformer2d.py
uvx ruff check src/diffusers/models/attention_processor.py src/diffusers/models/attention.py src/diffusers/models/transformers/transformer_2d.py src/diffusers/models/transformers/dit_transformer_2d.py src/diffusers/models/transformers/pixart_transformer_2d.py tests/models/test_exclusive_self_attention.py tests/models/test_layers_utils.py tests/models/transformers/test_models_dit_transformer2d.py tests/models/transformers/test_models_pixart_transformer2d.py
uvx ruff format --check src/diffusers/models/attention_processor.py src/diffusers/models/attention.py src/diffusers/models/transformers/transformer_2d.py src/diffusers/models/transformers/dit_transformer_2d.py src/diffusers/models/transformers/pixart_transformer_2d.py tests/models/test_exclusive_self_attention.py tests/models/test_layers_utils.py tests/models/transformers/test_models_dit_transformer2d.py tests/models/transformers/test_models_pixart_transformer2d.py
git diff --check

@taivu1998 taivu1998 marked this pull request as ready for review May 11, 2026 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for Exclusive Self Attention (XSA) in Diffusion Models (DiT, U-Net, etc.)

1 participant