The linear projection after the self attention:
bs = self_attention.size(0)
self_attention = self_attention.view(bs, -1)
linear_proj = F.relu(self.linear_projection(self_attention))
From the paper, they said "We project the self-attended neighbor encodings to a LARGER 4x2d dimensional space", so if you flatten out the last two dimensions of "self_attention" before the projection, how can you make sure neighbor < 4?
In my opinion, we should not flatten the last two dimensions before projection, we do projection on the last dimension whose size is 2d, and 2d < 4x2d, so we are projecting it to a larger space.
Please point it out if I understand this wrong at some place, or you do this on purpose for some reason.
The linear projection after the self attention:
bs = self_attention.size(0)self_attention = self_attention.view(bs, -1)linear_proj = F.relu(self.linear_projection(self_attention))From the paper, they said "We project the self-attended neighbor encodings to a LARGER 4x2d dimensional space", so if you flatten out the last two dimensions of "self_attention" before the projection, how can you make sure neighbor < 4?
In my opinion, we should not flatten the last two dimensions before projection, we do projection on the last dimension whose size is 2d, and 2d < 4x2d, so we are projecting it to a larger space.
Please point it out if I understand this wrong at some place, or you do this on purpose for some reason.