Why FlashAttention only on encoder? 

Hi, thank you for providing this flash implementation of t5. I am wondering however, why the code is set up to only have the attention variants work on the encoder and not on the decoder? See below the specific line:

class T5LayerSelfAttention(nn.Module):
    def __init__(self, config, has_relative_attention_bias=False):
        super().__init__()
        if config.is_decoder:
            # decoder always uses T5Attention
            self.SelfAttention = T5Attention(config, ...)
        else:
            # encoder uses one of {T5FlashAttention, T5TritonBasicAttention, etc.}
            self.SelfAttention = T5ATTENTION_TYPES[config.attention_type](config, ...)
        ...
        
        
        
I am planning on using only the decoder in an architecture, so I am curious as to why it was not integrated there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why FlashAttention only on encoder? #7

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Why FlashAttention only on encoder? #7

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions