Adding support for BigBird in GraphGPS#10598
Conversation
xnuohz
left a comment
There was a problem hiding this comment.
Can you compare bigbird vs performer (or other attention type) on a dataset (like zinc)? To see if the metrics get better or not.
| # BigBird attention depends on blocksize. | ||
| # The default value of 64 is too large for this test. | ||
| # Hence, we set it to 1 and assuming no random blocks. | ||
| if conv.attn_type == 'bigbird': | ||
| conv.attn = BigBirdAttention(channels=16, n_heads=4, block_size=1, | ||
| num_rand_blocks=1) |
There was a problem hiding this comment.
can we move this under GPSConv? when should the model use large or small block_size
There was a problem hiding this comment.
I had set the default block size to 64 assuming the sequence length of the input is very large (1024 or 4096) for example. A sequence length of less than the block size would mean there are no blocks which is incorrect I assume.
I had added assertions for the same as done here.
Further for example lets say the sequence length of the input to the transformer is 4 , block_size is 1 and rand_blocks I want to insert is also 1 (like the above case). This means there are 4 blocks say B0, B1, B2, B3
Note that random blocks cannot be allocated for blocks to local blocks and global blocks, in this case for example for block B1 ( B0 and B3 are the global blocks, those anyways attend to all other blocks)
- local window blocks are B0, B1 (itself), and B2
- globals are B0 and B3
Thus, choosing 1 random block is not possible so the random block list becomes empty thereby throwing an error.
Thus, to handle this for consistency and to allow random blocks to be inserted (and to ensure random blocks have both incoming and outgoing edge not just an outgoing edge in the blocked attention adjacency matrix), I fallback to dense attention if the sequence length is less than or equal to (2*num_random_blocks+5)*blocksize as shown here.
I am still quite unsure about these assertions and fallbacks maybe there's a better way.....
Also, Is the above explanation in the context of BigBird sparse attention right or am I missing something ?
There was a problem hiding this comment.
Hi,
I just made a few changes to make it more concise.
I simplified handling low sequence lengths to one constraint. Its automatically falls back to dense attention if sequence length is less than or equal to (2*num_random_blocks+5)*blocksize.
Further, I added padding if the sequence length is not divisible by the block size :)
I removed the above block from test_gps_conv.py and the tests still passes (it fall back to dense attention, however since the sequence length doesn't satisfy the above constraint..)
Sure, will try to do this |
BigBird strictly falls back to dense attention if seq_length <= (5 + 2*num_rand_blocks) * block_size. Add appropiate padding to input if the sequence length is not divisible by the block size.
|
Hi @xnuohz, Instead, I created a random input graphs with varying number of nodes as 1024,2048,... so on. Add created random edges for the graphs as well. The nodes have an embedding dimension of 1024.
Is this expected? |
|
@Rahul1epic can you get the results like this? |
… attention in bigbird
Previously, band mask was not being modified in place while unsqueezing its first dimension. This has been fixed
|
Hi @xnuohz, Also, since in the zinc dataset the number of graph nodes of each sample are less (around 20-40), bigbird was always falling back to original attention as the constraint on sequence length was not satisfied for any sample. Are the results reasonable? |


This PR implements BigBird in accordance with the paper https://arxiv.org/abs/2007.14062 and provides support for the same in GraphGPS.
This PR closes issue #8320
Details
Note: Since sparse attention depends on the block_size and number of random blocks, BigBird may fail if sequence length is too short. When such cases arise, BigBird falls back to regular dense attention. Such case is occurring in the gps_conv_test at the moment. Further, since BigBird is designed for long sequences, I have set the deafult value of block_size to 64 and number of random blocks to 3.